In this project, we’ll be exploring the red wine dataset for some interesting trends. The dataset conatins around 1600 instances, and each instance has 11 features and a label that corresponds to the quality of the wine, according to wine experts.
First we load the data and take a look at the summaries.
## [1] "Variables:"
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## [1] "Date frame:"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] "Summaries:"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
First thing we are going to do is plot each variable distribution by creating univariate plots to understand the structure of the individual variables in our dataset.
The distribution seems like a normal distribution with a slight positive skew.
volatile.acidity also is slighty positivly skewed. So There migh be realtionship with fixed.acidity
citric.acid seems quite noisy, with very high counts at the 0 and 0.5 values.
residual.sugar is heavily positvly skewed. This shows that most sugar values are very low.
We should transform this distribution to log scale to better look at it
This looks more like a skewed normal distribution.
Chlorides similar with residual.sugar, there migh be a relationship here. Let’s see how it looks in log scale.
Doesn’t look that much like residual.sugar anymore. so probably not related.
free.sulfur.dioxide Positvly skewed. But doesn’t closly resemble another distribution.
total.sulfur.dioxide has a strong positive skew.
This is interesting, Density almost has a perfect unskewed normal distribution.
Also the same for pH, but with a slight shift to the left. There could be a realtionship between the two.
Sulphates is slightly posivly skewed.
Alcohol is also positivly skewed, but with big noise. It looks slighly like free.sulfur.dioxide, so this should be invistigated.
Quality has a normal distibution with some noise, which makes sense due to the low number of possbile discrete values (1 through 10).
This Red Wine dataset contains 1,599 incstances with 11 features of the chemical properties of the wine, and 1 output ‘quality’. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
There are no missing values. And it’s also worth noticing that the minimum quality wine has a score of 3, and the maximum has a score of 8. So there are no very bad or very good wines, and most wines lie in the middle.
The main feature is obviously the output quality. It would be interesting if we can find a correlation between the quality and one of the other features.
Probably pH. From our Univariate plots, we can see a high resemblance between the two features’ distributions. So there is probably a relationship there.
No. It didn’t seem like there was a need for a new variable.
Most distributions were positivly skewed. Which makes since because most companies would probably try to minimize the chmicals values, except for some outliers.
The alcohol distribution was probably the most unusual distribution. Although it also is a bet similar to the free.sulfur.dioxide so there might be a relationship there.
Some tranformations to semi-log scale. That’s because some plots had very small values and were clumped up in a small area. Tranforming them to log-scale makes it easier to visualize the data.
There are three things we should look into.
First, We should take a look at the relationship between each feature and the output feature.
Second, we should look into the relationships between features that had similar distributions.
Finally, we should look at the correlation between features, which would give us more insight that we can explore more using a plot.
Here we are going to plot box plots for every feature v.s. the output feature, and look at statistic summaries. Best way to represent these plots is boxplots. Because it gives us so much information in a nice way for data grouped by catigory (in this case the quality rating). The red dots in the plots are the means.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.225 12.600
No clear relationship between fixed.acidity and the quality.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
This is interesting. It seems that (in general) the lower the volatile.acidity, the higher the quality. So they are reversly related.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
Some trend here, but not very clear. In general, higher qualities tend to have higher citric.acid.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
No clear realationship appearnt.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
No clear realationship appearnt.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
This is a bit interesting, it kind of looks like a normal distribution. Wines with high free.sulfur.dioxide tend to be ‘average’ where the ones with low free.sulfur.dioxide are either ‘good’ or ‘bad’.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.50 27.00 35.02 43.00 289.00
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
Also similar to free.sulfur.dioxide. Which kind of makes sense because they are both sulfur.dioxide, so total.sulfur.dioxide probably accounts for free.sulfur.dioxide so it’s distribution is affected by it.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9961 0.9976 0.9975 0.9988 1.0008
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9965 0.9965 0.9974 1.0010
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0031
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0037
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0032
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
Nothing obvious here. But using the summary, we can see that in general the lower the mean/median the higher the quality. It’s very subtle, though.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.280 3.291 3.380 3.780
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.163 3.230 3.267 3.350 3.720
Similar to density. Not vey obvious, but as median/mean decrease, the quality increases.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
It seems in general the higher the sulphates, the beeter the quality.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Similar to sulphates. Good quality wines seem to have higher values of alcohol.
In the Univariate Plots, we noticed some features having similar distributions. In this section we’ll be plotting these features agianst each other to see if ther is a patter.
First two distributions that had a similar shape are residual.sugar and chlorides:
There seem to be alot of outliers, we should plot this in log scale.
Much better. It seem to be consitrated around the bottom-left. There are some outliers of course, but overall there doesn’s seem to be a relationship.
Next up, alcohol and free.sulfur.dioxide:
The log scale doesn’t seem to make that much of a change. The plot is still scattered with values everywhere. Don’t seem related.
Next, fixed.acidity and volatile.acidity:
It seems to be consentrated around the middle, but not as much. This one is very scattered and spread out. There doesn’t seem to be a relationship.
Finally, density and pH:
Concentrated in the middle, but a bit scattered. No clear correlation.
This is interesting. Even though these distributions seemed similar in the univariate plots, that doesn’t mean there is a relationship!
volatile.acidity seemed to be reversely related to quality. In general, as quality went up, volatile.acidity went down. Which when thought about makes sense, because volatile.acidity measures the amount of acetic acid in wine. According to the Info provided with the dataset, high levels of acetic acid can lead to an unpleasant, vinegar taste.
The opposite was observed with citric.acid, however. Higher quality wines seem to have higher values of citric.acid. According to the info citric acid can add ‘freshness’ and flavor to wines, which explains the higher ratings.
Also, alcohol was positivly related to the rating. Which was a bit suprising, as I thought people drank wine more for the taste and not the alcohol factor. It seems, however, that the concentration of alcohol corresponds to higher rating in general.
There were mupltiple pairs of features that had similar distributions, but after invistigation, there didn’t seem to be any correlation between them. I decided to look at the correlations, however. And then invistigated pairs that had correlations higher than 0.6 (or lower than -0.6). These plots were more promising as an obvious correlation was observed.
fixed.acidity is positivly correlated with citric.acid and density. These both make sense because citric acid is an acid and fixed acidity is a measure of acidity. Also, Most acids are denser than water, so increasing the concentration of acids would therefore increase the density.
fixed.acidity was also correlated with pH. But this time it was negativly correlated. Since pH is a measure of acidity from 0 to 14, 0 being the most acidic. So it makes sense that the pH value would be lower as the fixed.acidity gets higher
Also, total.sulfur.dioxide was positivly correlated to free.sulfur.dioxide, which makes free.sulfur.dioxide is a subset of total.sulfur.dioxide.
The strongest visible positve relationship was fixed.acidity and density. Strongest negative is fixed.acidity and pH.
From the previous plots we saw how fixed.acidity was positvly correlated with density, and negatively correlated with pH. It would be nice if we can see that in one graph.
As we expected. As we move to the right-bottom corner (we increase density and decrease pH) the fixed.acidity increases. We log scaled the y-axis instead because there is almost no variance in the x-axis. Value range from 0.99 to 1 so taking a log wouldn’t make any differance. Still, there is almost no changevisible, because y-axis variance is also very low.
Now let’s try to find out more interesting things.
In the prior graphs and analysis, it seemed like citric.acid was positvely related to the quality, while on the other hand volatile.acidity was negativly related. I would like to see how the distribution of quality is as a function of these two variables.
Interesting! There is a pattern here. We see the consentration of good quality wines at the bottom right of the graph, where cetric.acid is high and volatile.acidity is low. And the bad quality the the top left
Alcohol seemed like an important feature of good wine too. So let’s see the distribution of quality against citric.acid and alcohol.
As expected. Higher levels of citric.acid and alcohol correspond to better quality in general.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
The two most important features were alcohol and citric.acid. As we saw in the plots, higher levels of citric.acid and alcohol correspond to better quality in general.
It was interesting how as the quality went up, both citric.acid and fixed.acidity went up as well. Even though, volatile.acidity causes quality to go down. This shows how fixed.acidity is not heavily affected by volatile.acidity.
Plot of the distribution of volatile.acidity respective to quality. It shows how in general, quality decreases as volatile.acidity increases.
This plot shows how fixed.acidity varries with both pH and density. Can easily see positive correlation with density, and negative correlation with pH.
This plot shows how quality is affected by both alcohol and citric.acid. A general trend is seen where quality is low when both values are low, and highest when either are high.
The Red Wine dataset contains almost 1600 rows of wine samples that have been tested by at least 3 wine experts. Each row contains 11 features that describe the chemical features of the wine sample. Date was collected around the year 2009. I started by trying to understand each individual feature by looking at the distributions. As I gained some insight, I moved on and made plots using each feature to gain more information. I noticed some similarities between some features distributions, so I went and explored those relationshops. Lastly, I explored the relationshop between the quality feature and every other feature to try and find a pattern or an indication of a good wine, based on it’s chemical values.
There was an obvious relationship between Fixed Acidity and both Density and pH. This can be explained using chemistry, as acid has higher concentration than water, and low pH values by difinition. There was also somewhat of a clear trend between the Citric Acid concentration and the quality. According to what the experts mentioned, Citric Acid naturally gives wine the ‘freshness’ feeling, so the correlation makes sense. However, I was surprised to find out that alcohol is also positivly correlated with good wines.
One limitations to this dataset is the size. 1600 is not a large enough number to be a represenstive sample. Therefore, there might have been some bias towards a certain type of wines that could have specific kind of features in them. Also, the dataset was collected in 2009. There are probably newer ways to make wine now which might alter it’s components. So this dataset might not be a good representation of how wine is nowadays. In the future we shoud do a re-run of this analysis on a larger dataset that is more represenstive of the population. Then a comparison between the results and trends between this analysis and the new one can be made.